Qwen3-VL is the most powerful visual language model in the Tongyi series. It has excellent text understanding and generation capabilities, in - depth visual perception and reasoning abilities, long - context support, strong spatial and video dynamic understanding abilities, and agent interaction capabilities. This repository provides GGUF - format weights, supporting efficient inference on devices such as CPUs and GPUs.
Multimodal
Transformers